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Abstract. We present a new method for inferring species trees from multi-copy 
gene trees. Our method is based on a generalization of the Robinson-Foulds 
(RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple 
leaves can have the same label. Unlike most previous phylogenetic methods us- 
ing gene trees, this method does not assume that gene tree incongruence is caused 
by a single, specific biological process, such as gene duplication and loss, deep 
coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the 
RF distance between two mul-trees, but it is easy to calculate the generalized RF 
distance between a mul-tree and a singly-labeled tree. Motivated by this observa- 
tion, we formulate the RF supertree problem for mul-trees (MulRF), which takes 
a collection of mul-trees and constructs a species tree that minimizes the total RF 
distance from the input mul-trees. We present a fast heuristic algorithm for the 
MulRF supertree problem. Simulation experiments demonstrate that the MulRF 
method produces more accurate species trees than gene tree parsimony methods 
when incongruence is caused by gene tree error, duplications and losses, and/or 
lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets 
containing hundreds of trees with up to a hundred taxa. 



1 Introduction 

With the development and spread of next generation sequencing technologies, there is 
great interest in incorporating large genomic data sets into phylogenetic inference. One 
challenge for such phylogenomic analyses is that genes sampled from the same set of 
species often produce conflicting trees l2T1l . Some of the incongruence may be due to 
errors in the phylogenetic analyses [ 33 1 . The discordance also may reflect evolutionary 
events such as recombination, gene duplication, gene loss, deep coalescence, and lateral 
gene transfer (LGT) [4 1 1 15 20 21 25 ]. Indeed, under certain conditions the most likely 
gene tree topology to evolve along a species tree will differ from the species tree ifTOl . 
Thus, in order to construct phylogenetic hypotheses from genomic data, it is necessary 
to address the incongruence among gene trees. 

Approaches to inferring species from conflicting gene trees typically use a model of 
gene evolution that can reconcile the gene tree and species tree topologies. In practice, 
these models are usually based on a single evolutionary mechanism, such as duplication 
and loss or deep coalescence. Although these models greatly simplify the true processes 
of genome evolution, more complex and realistic models can quickly become unwieldy, 
making it hard or impossible to analyze large genomic data sets. In this paper, we take 
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a step back and approach the question of finding a species tree for a given collection of 
gene trees though a method that is based on a tree distance metric and does not imply 
any specific evolutionary mechanism. 

Previous Work. Existing methods for inferring species trees from collections of gene 
trees can be divided into two broad categories: non-parametric methods based on gene 
tree parsimony (GTP), and likelihood-based approaches 121181191 . GTP methods take a 
collection of discordant gene trees and try to find the species tree that implies the fewest 
evolutionary events. GeneTree (24), DupTree (35) , and DupLoss |5 ] seek to minimize 
the number of duplications or duplications and losses. GeneTree |24], Mesquite (2D . 
PhyloNet (37), and the method of \5\ minimize deep coalescence events. The Subtree 
Prune and Regraft (SPR) supertree method (36) is based on minimizing the number of 
LGT events. Some of these methods are quite fast, enabling the analysis of very large 
data sets, but errors in the gene trees can mislead GTP analyses 181171281 . Also, in some 
cases GTP methods may be statistically inconsistent (34) . Many of the likelihood-based 
methods use coalescence models to reconcile gene tree topologies 1181191 . Although 
such likelihood-based approaches have a firm statistical basis, they often are computa- 
tionally expensive. 

While all the existing methods differ widely in their details, at a high level, except 
[ 2 ], they all are based on potentially restrictive assumptions about the source of discor- 
dance among gene trees. 

Our Contributions. We present a species tree inference technique that is not linked to 
any specific mechanism of gene tree discordance and has the scalability and accuracy 
expected for genome-wide analyses for many taxa. Our method takes as input a col- 
lection of multi-labeled gene trees (mul-trees), trees where multiple leaves can have the 
same label, and finds a species tree at minimum "distance" to the input trees. The ability 
to use mul-trees as input, instead of being restricted to single copy genes, allows this 
method to incorporate the wealth of genomic data from multi-copy genes into phyloge- 
netic inference, not only single-copy genes. Our distance measure is a generalization of 
the Robinson-Foulds (RF) distance to mul-trees. The RF distance has been useful as a 
supertree method for singly-labeled input trees |6 9], and in the singly-labeled setting, 
the distance based approach may be statistically consistent (3D . 
Our contributions are as follows: 

- We study the problem of computing the RF distance between two mul-trees, and 
show that it is NP-hard (Section [2}. 

- We formulate a RF supertree problem for mul-trees, which we call MulRF, that 
takes a collection of mul-trees as input and constructs a supertree that is at mini- 
mum RF distance from each input mul-tree (Section [3}. A key component of this 
approach is a simple and efficient technique to compute the RF distance between 
an input mul-tree and a singly-labeled species tree. (Note the contrast with the 
previously-mentioned NP-hardness result.) 

- We provide a fast heuristic algorithm for the MulRF problem (Section]?]). Heuristics 
are needed for this problem because it is NP-hard. 

- We implemented the MulRF algorithm and performed experiments on complex 
gene tree simulations (Section [5]). 



Inferring Species Trees from Gene Trees Using the Robinson-Foulds Distance 3 

c d c d 

Fig. 1. The contraction of edge {u, v} in the first tree produces the second tree; conversely, the 
refinement of vertex u in the second tree produces the first tree. 

Simulation experiments allow us to evaluate the accuracy of our method by com- 
paring it against the true species tree, something that cannot be done on real data. We 
compared the supertrees constructed by MulRF and GTP methods that consider only 
duplication |[35l, duplication and loss |5], and only LGT l36l with the true species 
trees. Likelihood-based methods were not considered because the simulated gene trees 
were comparatively large in size for these methods and no likelihood-based phyloge- 
netic method deals explicitly with duplication and loss or LGT . In all experiments, 
MulRF produced trees that are more similar to the true species trees than those ob- 
tained by other three methods. Further, our algorithm ran quickly on moderate- size data 
sets, finishing in under two minutes on data sets containing 300 gene trees evolved over 
100 taxon species trees, suggesting it is scalable for large-scale phylogenomic analyses. 

2 Preliminaries 

A phylogenetic tree or tree is an unrooted, leaf-labeled tree in which all the internal 
vertices have degree of at least three |29l . The leaf set of T is denoted by C(T). The set 
of all vertices of T is denoted by V(T) and the set of all edges by E(T). The set of all 
internal vertices of T is I(T) := V(T)\C(T). A tree is binary if every internal vertex 
has degree three. Let U be a subset of V(T). We denote by T[U] the minimum subtree 
of T that connects the elements in U. The restriction of T to U, denoted by T\u, is the 
phylogenetic tree that is obtained from T[U] by suppressing all vertices of degree two. 

Two trees T\ and T 2 are isomorphic if there exists a bijection r : V(T\) — )> V{T 2 ) 
such that {u, v} e E(T X ) if and only if {r(u), r(v)} e E(T 2 ) for all {u, v} e ( y( 2 Tl) ). 

The contraction of an edge in a tree collapses that edge and identifies its two end- 
points. The refinement of an unresolved vertex (i.e., an internal vertex with degree 
greater than three) expands that vertex into two vertices connected by an edge. Con- 
traction and refinement can be viewed as inverses of each other (Fig.[T]). 

The Robinson-Foulds (RF) distance between two trees T\ and T 2 , denoted by RF(T\ , 
T2), is the minimum number of contractions and refinements necessary to transform T\ 
into a tree isomorphic to T 2 [27]. The RF distance between two trees can be equiva- 
lently defined via splits. A split A \ B is a bipartition of the leaf set of a tree; A and B 
are the parts of split A\B. The set of all splits induced by the internal edges of a tree T 
is denoted by E(T). Now for T x and T 2 <27l 

RF(T U T 2 ) := |(r(Ti)V£(r 2 )) U (r(T 2 )\i7(ri))|. 

Two trees T x and T 2 are isomorphic if E(T{) = U(T 2 ) W\ page 44]. 

A phylogenetic mul-tree or mul-tree, is a tuple T = (T, M. cp) consisting of an 
unrooted tree T, a set of labels M, and a surjective labeling function ip : C(T) — » M 
that maps each leaf of T with a label in M. Informally, a mul-tree is simply a phylogeny 
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in which multiple leaves can have the same label (see Fig. [2]). For any label t G M, 
(p~ x (£) is the set of all leaves labeled L If cp is a bijection, the corresponding mul-tree is 
just a (singly-labeled) tree. In this paper, we use the traditional notation for a tree when 
the given mul-tree is clearly a tree. 

The concepts introduced above for unrooted trees naturally extend to mul-trees. 
For example, a mul-tree T = (T, M. cp) is binary if T is binary. Two mul-trees T\ = 
(Ti,M,(pi) and T2 = (T2, M, (p 2 ) are isomorphic if T\ and T2 are isomorphic under 
bijection r : V(T t ) V(T 2 ) such that <pi(u) = ^OH) for all u G £(Ti). 




Fig. 2. Two mul-trees that induce the same set of splits but are not isomorphic. 

The contraction and refinement based RF distance metric naturally extends to mul- 
trees fT3l . However, unlike singly-labeled trees, it is possible for two mul-trees 7i and 
T2 to satisfy i^(7i) = ^(T 2 ) and yet not be isomorphic (see Fig. [2]). Thus, the RF 
distance between two mul-trees cannot be computed by splits. Ganapathy et al. gave 
a worst-case exponential time algorithm for computing the RF distance between two 
mul-trees 1 13 ]. The next result suggests that a polynomial time algorithm is unlikely. 

Theorem 1. Computing the RF distance between two mul-trees is NP-hard. f\ 



3 MulRF Supertrees 

A profile is a tuple of mul-trees V : = (7i , T 2 , • • • , Tk ) , also called input trees, where % = 
(Ti, Mi,ipi) for each i G {1, ...,&}. A supertree on P is a singly-labeled phylogenetic 
tree S such that £(5) = Ui=i ^ e wr i te n t0 denote the total number of 

distinct leaves in the profile. In this paper, we assume that the size of each input mul-tree 
differs only by a constant factor from the size of the resulting supertree. 

We extend the notion of RF distance to the case where C(T\) C C(T2) by letting 
RF(T U T 2 ) := RF(T 1 ,T 2l £ iTl) ). We define the RF distance from a profile V to a 
supertree S for V as iLF(P, 5) := Y^rev RF (T, S). 

Let S('P) be the set of all binary supertrees for V. 

Problem 1 (RF Supertree for MUL-Trees (MulRF) ). 

Input: A profile V = (7i, 72, Tk) of unrooted mul-trees. 

Ow^wf: A supertree for P such that RF{P, S*) = mm SeB(v) RF(P, S). 

The MulRF problem is NP-hard even when all the input mul-trees are singly-labeled 
trees on the same leaf set l23ll . In fact, as stated in Theorem [T] just computing the RF 
distance between two mul-trees is hard. Nevertheless, we now show that it is straight- 
forward to compute the RF distance between an input mul-tree and a supertree. 



3 The proofs of this and other results are in the Appendix. 
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Let T = (T, M, ip) be an input mul-tree and S be a supertree, where M C £(5). 
The extended supertree is the mul-tree 5 constructed from 5 by replacing each a G 
C(S) by an internal node connecting to k leaves labeled with a, where k := \(p~ 1 (a)\ > 
1. See Fig. [3] A/w// differentiation of T is a leaf labeled tree T such that T and T are 
isomorphic. 

Let T = (T, M, <p) and S = (T', M', <p') be two unrooted mul-trees. Two full 
differentiations T and S of T and S, respectively, are consistent if for each a G M D 
M', ri((^ _1 (a)) = T2((^ /_1 (a)), where T and T are isomorphic under bijection n : 
V(T) V(T) and T' and S are isomorphic under bijection r 2 : V(T') -> V(S). For 
instance, a consistent full differentiation can be obtained by relabeling each of the k 
copies of each leaf label a by a\ , . . . , in both the trees. 

T S S 




Fig. 3. Input mul-trees T and the supertree S. The extended supertree S is also shown. 

Theorem 2 (HSD.LetT and S be two mul-trees. Then, RF(T,S) = min{i?F(T, S) : 
T and S are mutually consistent full differentiations ofT and S, respectively}. 

Theorem 3. Let T be an input mul-tree and S be the extended supertree. Then, all 
mutually consistent full differentiations ofT and S give the same RF distance. 

In short, the RF distance between an input mul-tree and a supertree can be computed 
by 1) extending the supertree, 2) producing one consistent full differentiation of the two 
mul-trees, and 3) applying the split based formula to compute the RF distance. 



4 Solving the MulRF Problem 

Our local search heuristic for the MulRF problem starts with an initial supertree and 
explores the space of possible supertrees in search of a locally optimum supertree; i.e., 
a tree whose score is minimum within its "neighborhood". The neighborhood is defined 
in terms of the Subtree Prune and Regraft (SPR) operation (TJ. An SPR operation on 
an unrooted, binary tree T cuts any edge, thereby pruning a subtree t, and then regrafts 
t by the same cut edge to a new vertex obtained by subdividing a pre-existing edge in 
T — t (Fig. [4]). The set of all trees obtained by the application of a single SPR operation 
on T is called the SPR neighborhood of T, and is denoted by SPRt- The size of this 
neighborhood is (9(n 2 ). 




Fig. 4. A schematic representation of the SPR operation. 
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Problem 2 (SPR Search). 

Input: A profile P = (7i,72,---,7fc)of unrooted mul-trees and a binary supertree S for 
V. 

Output: A tree S* £ SPR S such that RF(V, S*) = mm s >eSPR s RF(V, S f ). 



In Section 4.2 we present an algorithm for the SPR search problem that runs in time 
0{n 2 k). The algorithm relies on results from (9), which characterize the RF distance 
between unrooted trees in terms of least common ancestors in rooted versions of those 
trees. These properties enable us to update the RF distance quickly after an SPR oper- 
ation has been applied to one of the trees. For completeness, we briefly review these 
results in the next subsection. For a full discussion with proofs, see . 



4.1 Robinson-Foulds Distance and Least Common Ancestors 

A rooted phylogenetic tree T has exactly one distinguished vertex rt(T), called the 
root. The root is a degree-two vertex if the tree is binary. A vertex v of T is internal if 
v £ V(T)\(C(T) U rt(T)). The set of all internal vertices of T is denoted by J(T). We 
define to be the partial order on V(T) where x y if y is a vertex on the path 
from rt(T) to x. If {x, y} £ E(T) and x y, then y is the parent of x and x is a child 
of y. The least common ancestor (LCA) of a non-empty subset L C V(T), denoted by 
LCAt(£), is the unique smallest upper bound of L under ^j. 

Let T v denote the subtree of T rooted at vertex v £ V(T). For each node v £ /(T), 
Cj(v) is defined to be the set of all leaf nodes in T v . Set Cj(v) is called a cluster. Let 
H(T) denote the set of all clusters of T. The RF distance between rooted trees T, § over 
the same leaf set is defined as l27l 

RF(T,§) := \(H(T)\H(§)) U («(S)\7£(T))|. 

Let T and S be the trees that result from rooting T and S at the branches incident 
on some arbitrarily-chosen but fixed taxon r £ £(T) fl £(S) (Fig. [5]). 




Fig. 5. Tree T with leaf set {a, 5, c, d, e}. The rooted tree T with r = a is also shown. 

Lemma 1 (|9 |). Le£ T S be two unrooted phylogenetic trees with C(T) = C(S), 
then RF '(T,S) =i?F(T,S). 

We extend RF distance to the case where £(T) C £(§) in the same way as for 
unrooted trees. That is, RF(T, S) := RF(T, $\c(T))> where S|£(f) is the rooted phylo- 
genetic tree obtained from §[£(T)] by suppressing all non-root degree-two vertices. 

We now show how to compute the RF distance in this general setting, without ex- 
plicitly building §|£(t). We need two concepts. Let v £ I(E>). The restriction of C§(v) 
to C(T) is Cj(v) := {w £ C(S V ) : w £ £(T)}. The vertex function /§ assigns each 
u £ /(T) the value = \U\, where C7 := {v £ J(S) : C T (u) = C T (v)}. Observe 
that if £(S) = £(T), then for all u £ /(T), < 1. 
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Lemma 2 (|9|). RF(T,S) = |£(T)| - |7(T)| + 2\J=k\ - 2, where T s := {u G I(T) : 
fs(u) = 0}. 




b c |_aj e d f bd 
Fig. 6. The LCA mapping from S to T. Vertex a in S is mapped to null as a £ £(T). The internal 
vertices of T are labeled with the values of the vertex function. 

We now describe a 0(n) -time algorithm to compute the initial vertex function for S 
relative to T, along with the RF distance between these two trees. The algorithm relies 
on LCAs. For S and T, the LCA mapping Mgj : V(§) V(T) is defined as 



LCA t (Ct(w)), ifC T (i/)^0; 



null, 



otherwise. 



See Fig. [6] 



Lemma 3 (|9|). For all u e J(T), fg(u) = \B\, where B := {v e I(S) : Mgj(v) = 
u and \Cj(u)\ = \Cj(v)\}. 

The LCA computation for T can be done in 0(n) time, and the LCA mapping from 
S to T can be done in 0(n) time | 7 ] in bottom-up manner. Further, from Lemmas [2] and 
[3] we can compute the RF distance between § and T in 0(n) time as well. 



4.2 Solving the SPR Search Problem 

Let T = (T, M, (p) be an arbitrary mul-tree in V. We now show how to compute 
the RF distance from T to each tree in the SPR5 neighborhood in linear time of the 
size of the neighborhood. Let S be the supertree S after extending for T. Let T and 
S be any two mutually consistent full differentiations of T and 5, respectively. By 
Theorem |3j computing the RF distance between an input mul-tree T and all trees in 
the SPR neighborhood of an extended supertree S reduces to finding the RF distance 
between T and each tree in the SPR neighborhood of S. 

Suppose an SPR operation on S cuts the edge e = y}, and that X, Y are the 
subtrees of S — e containing x, y, respectively. Suppose subtree Y is pruned and re- 
grafted by the same cut edge to a new vertex obtained by subdividing an edge in X. 
The degree-two vertex x is suppressed and the new vertex is denoted by x. Observe that 
there are 0(n) possible edges in X to regraft Y. We perform regrafts in an order that 
leads to a constant time RF distance computation for each successive regraft. 

Observation 1. For Z e {X, Y}, if M n C(Z) = 0, then RF(S' , T) = RF(S, T) for 
each S r obtained from S by redrafting Y on any edge in X. 
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We begin by regrafting Y at an edge incident to a leaf in X. Let S and S denote, 
respectively, the tree that results from performing the prune-and-regraft and the full 
differentiation of this result tree. We compute the RF distance between T and S using 
the algorithm described in the previous section. This method works by computing the 
RF distance between the rooted trees T and 8 obtained by rooting T and S at any leaf 
labeled by an element of M n C(X). (Note that, by Observation 1, if M n C(X) = 0, 
then T's distance from S is same as S.) The algorithm also computes the LCAs for T 
and the LCA mapping from 8 to T. 

We perform the remaining regrafts of Y on edges in X by iterating through the 
vertices of X, starting from a leaf and exploring as far as possible along each branch 
before backtracking. The k th regraft is performed on the edge between the k th and 
k + I st vertices in this iteration. Let us denote this ordering of edges by H. See Fig. [7] 
Observe that each two distinct consecutive edges in N are adjacent. We will show that, 
after the initial RF distance computation for S, we can compute in constant time the RF 
distance for the result of regrafting on each successive (adjacent) edges in 

m, 

m& \ Fig. 7. A tree with a subtree re- 

x/a M " 2 grafted at edge {a, b}. One it- 

5>q eration of vertices in the tree is 

m 5 c mi, a, mi , a, 6, c, mz , c, m± , c, b, d, m$ , 

/\ d, me , d, 6, a, mi . The resulting ordering ft 

m{ is {mi, a}, {a,m 2 }, {a, mi}. 

Beginning with S, each S' G SPRs helps in computing the RF distance of the 
next tree in the above regraft order. Assume that S f G SPRs results from regrafting 
Y at edge {a, b} in X as shown in Fig. [7] Let the rooted tree obtained after extending 
and differentiating S f be denoted by The LCA mapping and RF distance have been 
computed for S r . Let S" G SPRs denote the tree obtained by regrafting Y on edge 
{6, c} in X and the rooted counterpart of S" is 8". 

Next, we find the vertices of 8" whose LCA mapping Ms» j has changed as a result 
of the SPR operation. Based on the topology of there are three cases: 

1. x is parent of b and b is parent of c. For all t G 7(§ // )\{x, b}, A4§»j(t) = A4§>j(t). 
Further, M§»j(b) := Ms',y(x), and M$»j(x) := LCA(A1§/,t(c), M&j(y)). 

2. b is parent of c and x. For all t G 7(§ // )\{x}, Mg»j(t) = Mg>j(t). Further, 
M$» iT (x) := LCA(Ms> jT (c),Ms> jT (y)). 

3. b is parent of x and c is parent of b. For all t G 7(8 // )\{6, x}, Mg"j(t) = M§'j(t). 
Moreover, M§»,t(x) := M§',j(b), and Ms",j(b) •= LCA(A^s / ,t(^), Ms',t(o>))> 

Since we can check in constant time which one of the above three cases holds, 
the LCA mappings can be updated in constant time too. Let H be a set {u G I(T) : 
h" (u) 7^ /§' ( u )}- Set H can be computed in constant time. Observe that H has at most 
four vertices. Let G denotes the set {w G H : fs>(w) = 0, but f§>r(w) > 1}, and L 
denote the set {w G H : f§>(w) > 1, but fs»(w) = 0}. 

Lemma 4. RF(S",T) = RF (S',T) -2\G\ + 2\L\. 

Thus, after the initial regraft of Y at a leaf in X, we can compute in constant time 
the RF-distance between T and the supertree that results from each subsequent regraft. 
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Lemma 5. For each {x,y} G E(S), where X and Y are two resulting subtrees con- 
taining x and y, respectively. The RF distance for the set of trees obtained by regrafting 
X (resp. Y) on each edge in Y (resp. X) can be computed in 0{n) time. 

Theorem 4. The SPR Search problem can be solved in 0(n 2 k) time. 

5 Experimental Evaluation 
5.1 Method 

Simulated data sets. We generated model species trees using the uniform speciation 
(Yule) module in the program Mesquite l22l . Two sets of model trees were generated: 
i) 50 taxa trees of height 220 thousand years (tyrs), ii) 100 taxa trees of height 440 tyrs 
(note that the dates are relative; they do not have to represent thousands of years). Each 
data set had 20 model species trees. We evolved 150 and 300 gene trees for each 50- and 
100-taxon model species tree, respectively. We used Arvestad et al.'s | 3 ] duplication- 
loss model to evolve gene trees within the model tree. We applied LGT events on the 
evolved gene trees, using the standard subtree transfer model of LGT. One LGT event 
causes the subtree rooted at a vertex c to be pruned and regrafted at an edge (a, b), 
where a and b together are not in the path from the root (of the tree) to c. We used gene 
duplication and loss (D/L) rate of 0.002 events/gene per tyrs and LGT rate of 2 events 
per gene tree. In other words, a gene tree can have to 2 LGT events. 

We evolved gene trees based on four evolutionary scenarios: i) no duplications, 
losses, or LGT (called none), ii) D/L rate 0.002 and no LGT (called dl), iii) no du- 
plication or loss, and LGT rate 2 (called Igt), and iv) D/L rate 0.002 and LGT rate 2 
(called both). The parameter values for each simulation are called the model condition. 
We deleted to 25% of the taxa (selected at random) from each gene tree to represent 
missing data, which is common in almost all phylogenomic studies.. For each gene tree, 
we used Seq-Gen l26l to simulate a DNA sequence alignment of length 500 based on 
the GTR+Gamma+I model. The parameters of the model were chosen with equal prob- 
ability from the parameter sets estimated in lf]~2l on three biological data sets l32l . We 
estimated maximum likelihood trees from each simulated sequence alignment using 
RAxML [30], performing searches from 5 different starting trees and saving the best 
tree. We rooted each estimated gene tree at the midpoint of the longest leaf-to-leaf path 
before the species tree construction. 

Species tree estimation. We estimated species trees via GTP minimizing only the num- 
ber of duplications (Only-dup) l35lL GTP minimizing duplications and losses (Dup- 
loss) 0, GTP minimizing LGT events (SPR supertree or SPRS for short) (36), and 
the MulRF heuristic. Both Only-dup and Dup-loss were executed with their default set- 
tings, including a fast leaf-adding heuristic for initial species tree construction. SPRS 
was run with 25 iterations of the global rearrangement search option. For 50-taxon data 
sets, it calculated the exact rSPR distance if it was 15 or less, and otherwise it esti- 
mated the rSPR distance using the 3 -approximation. For the 100-taxon data sets, we 
used the 3 -approximation of the rSPR distance. SPRS does not allow mul-trees as in- 
put. Therefore we only ran it on none and Igt data sets. Experiments were performed 
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on the University of Florida High Performance Computing test nodes with 8 to 24 cores. 



Num. Taxa 


Sets 


Only-dup 


Dup-loss 


SPRS 


MulRF 




none 


< Is 


2s 


8h 34m 32s 


3s 


50 


Igt 


< Is 


2s 


8h 30m 30s 


2s 




dl 


< Is 


3s 


NA 


6s 




both 


< Is 


3s 


NA 


6s 




none 


9s 


37s 


21h 34m 25s 


58s 


100 


Igt 


lis 


49s 


19h 6m 9s 


51s 




dl 


9s 


30s 


NA 


lmlls 




both 


lis 


37s 


NA 


lm 15s 



Table 1. Running time for species tree estimations 



Performance evaluation. We report the average topological error (ATE) for each model 
condition. This is the average of the normalized RF distance (dividing the RF distance 
by number of internal edges in both trees) between each of the 20 model species trees 
and their estimated species trees. An ATE of indicates that two trees are identical, and 
an ATE of 100 indicates that two trees share no common splits. We also compared the 
number of gene duplications estimated by Only-dup and Dup-loss and losses estimated 
by Dup-loss with the actual number of these events in each gene tree simulation. 



5.2 Results 

Both Dup-loss and Only-dup overestimate duplications for sets dl and both in both 50- 
and 100-taxon model trees (Fig.[8ja,b)). They also imply many duplications in the none 
and Igt data sets, where the simulations included no duplications. Similarly, Dup-loss 
overestimates losses for sets dl and both and also erroneously estimates losses for sets 
none and Igt (Fig.[5tc,d)). 



50 taxa 



100 taxa 



■ Actual 
Only-dup 
B Dup-loss 



z 



Actual 
Dup-loss 





Sets 



igt 



Fig. 8. Graphs a-b shows duplications estimated by Only-dup and Dup-loss, and Graphs c-d losses 
estimated by Dup-loss, against the actual number of these events in gene trees, for all model 
conditions; means and standard errors are shown. 
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For each set of 50- and 100-taxon model trees, the MulRF species trees are more 
accurate than those produced by the other three methods. For example, the ATE rate of 
MulRF is 16.75% to 39.91% lower than the method of lowest ATE rate among other 
three methods (Fig. [9]). 

50taxa lOOtaxa 




none tgt dt both none tgt 



Fig. 9. Average topological error (means with standard error bars) for species tree constructed by 
Only-dup, Dup-loss, SPRS, and MulRF method, for all model conditions. 

In order to examine how Only-dup, Dup-loss, and SPRS methods perform when the 
process of gene tree evolution only includes events that these methods assume to be the 
source of discordance, we simulated gene trees that using a model that includes only 
duplication and loss, or LGT. While SPRS could not be tested on the former, Only-dup 
and Dup-loss had high ATE rate (indicating low accuracy) on the latter. 

6 Conclusion 

We presented a new approach for inferring species tree from incongruent gene trees 
that is not based on potentially restrictive assumptions about the causes of the conflict 
among gene trees. This approach is appealing for real, genomic data sets, in which many 
processes such as deep coalescence, recombination, gene duplications and losses, and 
LGT, as well as phylogenetic error likely contribute to gene tree dischord. In simula- 
tion experiments, the MulRF method estimated species trees more accurately than other 
GTP methods, and it appears to be relatively robust to the effects of phylogenetic error, 
gene duplication and loss, and LGT. In addition, the MulRF method is fast, estimating 
100-taxon species trees from hundreds of gene trees in under two minutes. One reason 
for this strong performance may be the underlying unrooted metric. The advantage of an 
unrooted metric compared to a rooted one, like those used in the other supertree meth- 
ods, has been well-studied in the context of RF supertrees for singly-labeled trees l9l . 
Further tests are needed to characterize the performance of MulRF methods under dif- 
ferent evolutionary scenarios. Another future direction will be to incorporate estimates 
of gene tree uncertainty into the supertree analysis by weighing the splits differently 
when computing the RF distance. 
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Appendix 

Computing RF Distance between two mul-trees is NP-Complete 

The proof relies on a reduction from the following NP-complete 1 14 ] problem. 

Problem 3 (Exact Cover by 3 -Sets (X3C)). 

Input: S := {si, s n }, where n = 3q, and C := {Ci,...,C m } such that d — 

Output: Are there exist sets d x , Ci q such that Uj=i ^ = SI 

Note that X3C remains NP-complete ifTSTl even when each element of S occurs in 
exactly three subsets in C, thus m = n = 3q. We take this version of X3C for reduction. 

Given an instance for the X3C problem, we construct two mul-trees T\ and T2 
such that transforming from T\ into T2 (or vice versa) requires k (to be specified later) 
contractions and refinements if and only if an exact cover of S exists. The construction 
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is as follows. For each G S, we construct two rooted binary trees T and T' that take a 
"large" number of contractions and refinements to transform into each other. Let k and 
t be two positive integers such that k + 2 > n 2 and k + 2 = 2 £ . Tree T and T" have 
fc + 2 leaves. Tree T' has the same topology as T, but for each cherr}j^](x, y) in T, x and 
2/ are in different subtrees T' u and in T' ', where it and u are two children of rt(T f ). 
For each G 5, corresponding trees T and T' have unique leaves (see Fig.fTo}) 






be d e f & h aceffbdfh 
Fig. 10. Two possible trees T and T' on 8 leaves with RF distance 12. 

Lemma 6. RF(T,T') = 2k. 

Proof. RF(T,T') = 2\U{T)\H{T% since T and V are binary trees. T and T' are 
binary trees on k + 2 leaves, thus H(T) = H{T') = k. Thus it suffices to show that 
no cluster in T matches any cluster in T' . Let v G I(T), the corresponding cluster 
C(v) contains leaves of 1 < p < (k + 2)/4 cherries. From the construction, T r has 
both leaves of each cherry in different subtrees under the root rt(T')\ thus there is no 
matching cluster for C(v) in T' . □ 




(a) (b) 
Fig. 11. (a) Structure of mul-tree 71 and (b) A toll sequence of k leaves. 

We are now ready for the construction of 7i and 72- Figure 11 (a) | outlines the struc- 
ture of T\ . The solid rectangles represent toll sequences of k uniquely labeled leaves 



(Fig. |ll(b)| ). The left side of 7i has n triangles one for each of the n elements in S. 
Each triangle represents a tree T corresponding to S{ G S, connecting through its root. 
The right side of T\ has n sets of 3 triangles corresponding to the subsets in C; for 
each subset C{ = {s^ , s i2J s i3 }, the triangles represent three trees T's, corresponding 
to each s$ (for 1 < j < 3), connected through their roots. 

72 has the similar structure except that 72 has tree T' for each si G S and tree T for 
each element of d G C (for 1 < i < n). Thus, 72 has T's on the left side and Ts on 
the right side, which is opposite to what T\ has. 



4 Two leaves connected with the same internal vertex in a tree are called a cherry. 
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Lemma 7. Mul-trees 7i and T2 can be constructed in polynomial time. 

Proof. Trees T and T' are rooted binary trees on k + 2 leaves. T and T' can be con- 
structed in polynomial time, and so the 4n copies of each (for 7i and %). Further, 2n 
toll sequences in for each T\ and 72) can be constructed in polynomial time. There are 
constant number of rest of the vertices in T\ and T2 • Hence, the Lemma. □ 

Here is the connection between exactly covering S and transforming 71 into T2 by 
contractions and refinements: To transform T\ into 72, all we need is to convert each 
tree T on the left into T' and each tree T' on the right into T. From Lemma[6j this costs 
2Aqk contractions and refinements. A rather clever technique is to swap 3q Ts on the 
left with their counterparts on the right and to transform the remaining 6q T's on the 
right into Ts. If an exact cover d r , Ci q of S exists, we can partition the 3q Ts into 
q groups according to the cover. For each Cj (j = i q ) in the cover, we swap the 

corresponding group of trees for sequences Sj 1 , Sj 2 , Sj 3 with their counterparts. 

Lemma 8. All T's for each Cj (j = ii, ...,i q ) can be swapped with corresponding Ts 
by 2(k + 1) contractions and refinements. 

Proof. Take the toll sequence corresponding to Cj and contract its k + 1 edges; i.e., 
(k — 1) internal edges and 2 edges at both the sides of the toll sequence. Now refine 
it so that corresponding Ts move in Cj and T's stay in the left. This takes 2(k + 1) 
contractions and refinements. □ 

From Lemma [8] if the exact cover of S exists, then 6q trees can be transformed by 
2q(k + 1) contractions and refinements. Remaining 6q T's can be transformed into Ts 
by 12qk contractions and refinements. Hence, we have the following lemma. 

Lemma 9. If set S has an exact cover then the RF distance between 71 and T2 is 
= 2q(k + 1) + 12kq. 

If there is no exact cover of S, then either more than 6q trees (T or T') are trans- 
formed separately or more than q group swaps are performed. The construction guar- 
antees that both cases will cost more than the cost of transforming (71 into T2) in exact 
cover case. Hence, we conclude the following. 

Theorem 5. Set S has no exact cover if and only if the RF distance between 7i and T2 
is more than k — 2q(k + 1) + 12kq. 

Other Proofs 

Proof (Theorem^. Let the given input mul-tree T is such that T := (T, M, cp). We 
prove the Theorem by showing that for each a G M, where \ip~ 1 (a)\ = k, all k\ ways 
of uniquely relabeling corresponding k leaves in both T and S result into the same 
number of matched and unmatched splits in the corresponding mutually consistent full 
differentiations. The set of splits in T can be divided into two categories: 

- Category 1: Splits that have all the leaves labeled with a in one part. Such a split 
will always have a match irrespective of the labeling. 
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- Category 2: The remaining splits. Such splits are not present in <S, therefore, they 
will never have a match irrespective of the labeling. □ 

Proof (Observation 1). Let the extension of S be S := (T', M', <p'). Let S be a full 
differentiation of S that is consistent with T, where T' and S are isomorphic under 
bijectionr : V(T') -+ V(S). 

For Z e {X, Y} 9 let S[Z] = {I e C(S) : ^(r"^/)) e £(Z)}. 

Since, £(S|£ (T ))nS[Z] = 0, RF (S\ C (t), T) = ^(Sj^^, T). Now, RF(S, T) = 
RF(S,T) =i?F(S| jC(T ),T) =^F(Sj £(T) ,T) = itF(S 7 ,T) = AF(5 7 , 7^. □ 

Proof (Lemma^. 

RF($",T) = \C(T)\ - \I(T)\ -2 + 2|^| 

= |£(T)|-|/(T)|-2 

+ 2|{txG/(T):/s//(tx) = 0}| 

= |£(T)|-|J(T)|-2 + 2|^| 

-2|{wGir:/s/W = 0&/s//(w)>l}| 
+ 2\{ueH:fo,(u) = 0kfv(u)>l}\ 

= i?F(S / ,T)-2|G|+2|L| 

□ 

Proo/ (Lemma^. The RF distance computation for S, obtained by pruning Y and 
regrafting at a leaf in X, can be done in &(n) time. After 5, the RF distance for each 
tree 5", obtained by regrafting Y on each edge in X, can be computed in constant 
time by performing regrafts in the order of M There are 0(n) edges in H, thus the RF 
computation for all the trees can be done in 0{n) time. The same argument applies for 
pruning X and regrafting on the edges in Y. □ 

Proof (Theorem^. There are 0(n) internal edges in S. For each edge {x,y} in S, 
where X, Y be two resulting subtrees containing x,y, respectively. The RF distance for 
all the trees obtained by regrafting X (or Y) on each edge in Y (or X) can be computed 
in 0{n) time from Lemma [5] Thus for k input trees the RF distance can be checked in 
0(nk) time. The total time over all 0(n) internal edges is 0(n 2 k). □ 



